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Abstract 

To store and search genomic databases efficiently, researchers have recently 
started building compressed self-indexes based on grammars. In this paper 
we show how, given a straight-line program with r rules for a string 5[l..n] 
whose LZ77 parse consists of z phrases, we can store a self-index for S in 
0{r + zloglogn) space such that, given a pattern P[l..m], we can list the occ 
occurrences of P in 5 in O (m^ + occ log log nj time. If the straight-line program 
is balanced and we accept a small probability of building a faulty index, then we 
can reduce the ©(m^) term to ©(mlogm). All previous self-indexes are larger 
or slower in the worst case. 

Keywords: compressed self-indexes, grammar-based compression, Lempel-Ziv 
compression 



1. Introduction 

With the advance of DNA-sequencing technologies comes the problem of 
how to store many individuals' genomes compactly but such that we can search 
them quickly. Any two human genomes are 99.9% the same, but compressed 
self-indexes based on compressed sufhx arrays, the Burrows- Wheeler Transform 
or LZ78 (see pLj for a survey) do not take full advantage of this similarity. 
Researchers have recently started building self-indexes based on context-free 
grammars (CFGs) and LZ77 [7, which better compress highly repetitive strings. 
A compressed self-index stores a string S'[l..n] in compressed form such that. 
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Figure 1: A balanced SLP for abaababaabaab (left) and the corresponding parse tree (right). 

first, given a position i and a length £, we can quickly extract S[i..i + £— 1] and, 
second, given a pattern P[l..m], we can quickly list the occ occurrences of P in 
S. 

Claude and Navarro ■3] gave the first compressed self-index based on gram- 
mars or, more precisely, straight-line programs (SLPs). An SLP is a context-free 
grammar (CFG) in Chomsky normal form that generates only one string. Fig- 
ure [1] shows an example. They showed how, given an SLP with r rules for a 
string S, we can build a self- index that takes 0{r) space and supports extraction 
in 0{{£ + h)logr) time and pattern matching in ©((/^(m + /i) + /i occ) log r) 
time, respectively, where h is the height of the parse tree. Our model is the 
word RAM with 0(logri)-bit words; except where stated otherwise, by log we 
mean log2 and we measure space in words. The same authors [1] recently 
gave a self-index that has better time bounds and can be based on any CFG 
generating S and only S. Specifically, they showed how, given such a CFG 
with r' distinct terminal and non-terminal symbols and R symbols on the 
righthand sides of the rules, we can build a self-index that takes 0{R) space 
and supports extraction in 0{£ + hlog(R/h)) time and pattern matching in 
0(m^ log(logn/ logr') + occ log r') time. 

If we are not concerned about the constant coefficient in the space bound, 
we can improve Claude and Navarro's time bound for extraction. Calculation 
shows that h\og{R/h) > logn. Given a CFG generating S and only S with R 
symbols on the righthand sides of the rules, we can turn it into an SLP with 
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0{R) rules (although the number of distinct symbols and the height of the parse 
tree can each increase by a factor of ©(log n)). Bille et al. [5] showed how we can 
store such an SLP in 0{R) space and support extraction in 0{£ + logn) time. 
Combining their result with Claude and Navarro's improved one, we obtain an 
index that still takes 0{R) space and ©(to^ log(log(n) / log r') + occ logr') time 
for pattern matching but only 0{£ + logn) time for extraction. 

In this paper we show how, given an SLP for S with r rules, we can build a 
self-index that takes 0{r + z log logn) space, where z is the number of phrases 
in the LZ77 parse of S, and supports extraction iiiO{£ + log n) time and pattern 
matching in 0(rn^ + occ log log rt) time. Therefore, by the observations above, 
given a CFG generating S and only S with R symbols on the righthand sides 
of the rules, we can build an index with the same time bounds that takes 
0(i? + zloglogn) space. 

If we are given a balanced SLP for S — i.e., one for which the parse tree is 
height- or weight-balanced [B] — and we accept a small probability of build- 
ing a faulty index, then we do not need Bille et al.'s result to extract in 
C(^-l-logn) time and we can reduce the time bound for pattern matching 
to O (to log TO + occ log log n) . Rytter [7] showed how we can build such an 
SLP with O{z\og{n/ z)) rules, and proved that no SLP for S has fewer than z 
rules. His algorithm still has the best known approximation ratio even when the 
SLP need not be balanced, but performs badly in practice. Recently, however, 
Maruyama, Sakamoto and Takeda [S] gave a practical online algorithm that 
produces a balanced SLP with ©(zlog^ nj rules. In other words, requiring the 
SLP to be balanced is a reasonable restriction both in theory and in practice. 

Table [T] summarizes Claude and Navarro's bounds and our own. Since all the 
self-indexes mentioned can be made to support extraction in 0{£ + logn) time 
without increasing their space usage by more than a constant factor, we do not 
include this bound in the table. As noted above, given a CFG generating S and 
only S with R symbols on the righthand sides of the rules, we can turn it into an 
SLP with 0{R) rules, so our first result is as general as Claude and Navarro's; 
the r in the second row of the table can be replaced by R. By Rytter's result. 
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Table 1: Claude and Navarro's bounds and our own. In the first row, R is the number of 
symbols on the righthand sides of the rules in a given CFG generating S and only S, and r' 
is the number of distinct terminal and non-terminal symbols in that CFG. In the second and 
third rows, r is the number of rules in a given SLP for S — which must be balanced in the 
third row — and z is the number of phrases in the LZ77 parse of S. 
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we can assume z < r — O {z log{n / z)) . 

There are other self-indexes optimized for highly repetitive strings but com- 
paring ours against them directly is difficult. For example, Do et al.'s |9] space 
bound is in terms of the number of phrases in a new variant of the LZ77 
parse [TU], which can be much larger than z; Huang et al.'s [TT] is bounded 
in terms of the number and length of common and distinct regions in the text; 
Maruyama et al.'s [12 time bound for pattern matching depends on "the num- 
ber of occurrences of a maximal common subtree in [the edit-sensitive parse] 
trees of P and 5"; Kreft and Navarro's [17 time bound depends on the depth 
of nesting in the LZ77 parse. 

We still use many ideas from Kreft and Navarro's work, which we describe 
in Section[2j In Section[3]we show how, given an SLP for S with r rules, we can 
build a self- index that takes 0{r + zlog logn) space and supports extraction in 
0{£ + logn) time and pattern matching in 0{nn? + occ log logn) time. We also 
show how, with the same self- index, in O (rn'^ log log n) time we can compute 
all cyclic shifts and maximal substrings of P that occur in S. In Section [3] we 
show how, if the SLP is balanced and we accept a small probability of building 
a faulty index, then we can reduce the time bound for pattern matching to 
0(mlogm + occ log logn). Finally, in Section [5] we discuss directions for future 
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Figure 2: The LZ77 parse of ^^abaababaabaab" (left) and the locations of the phrase sources 
plotted as points on a grid (right). In the parse, horizontal lines indicate phrases' sources, 
with arrows leading to the boxes containing the phrases themselves. On the grid, a point's 
horizontal coordinate is where the corresponding source starts, and its vertical coordinate is 
where the source ends. Notice that a phrase source S[i..j] covers a substring S[i' ..i' + m — 1] 
if and only if the point is above and to the left of the point + m— 1). 



work. 

2. Kreft and Navarro's Self-Index 

The LZ77 compression algorithm works by parsing S from left to right into 
z phrases: after parsing S[l..i~ 1], it finds the longest prefix S[i..j ~ 1] of S[i..m] 
that has occurred before and selects S'[z..j] as the next phrase, li j = 1 then 
the phrases consists only of the first occurrence of a character; otherwise, the 
leftmost occurrence of S[i..j — 1] is called the phrase's source. Figure [2] shows 
an example. 

Alstrup, Brodal and Rauhe jH] described a data structure for pattern match- 
ing in dynamic strings that supports the non-destructively splitting and concate- 
nating of substrings in time polylogarithmic in the current length of the string. 
It follows that, given the LZ77 parse of S, we can build a compressed index for 
5* in z log'^^-' n time and space. Nevertheless, Kreft and Navarro [13] gave the 
first (and, so far, only) practical compressed self-index based on LZ77. Their 
index takes 0(2; log n) + o{n) bits and supports extraction in 0{£d) time and 
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pattern matching in 0{rn?d + (m + occ) log z) time, where d < z is the depth 
of nesting in the parse. They considered only the non-self-referential version of 
LZ77, so z > log n; for ease of comparison, we do the same. They also gave 
a variant of LZ77 called LZ-End, with which they can reduce the extraction 
time to 0{£ + d). Although they showed that LZ-End performs well in practice, 
however, they were unable to bound the worst-case size of the LZ-End parse in 
terms of z. 

Kreft and Navarro start by building two Patricia trees, one for the reverses 
of the phrases in the LZ77 parse and the other for the suffixes of S that start at 
phrase boundaries. A Patricia tree [15] is a compacted trie for substrings of a 
stored string, in which we store only the first character and length of each edge 
label; the leaves store pointers into the string itself such that, after finishing a 
search at a node in the tree, we can verify that the node's path label matches the 
string we seek. The total size of the two Patricia trees is 0{z). Since Kreft and 
Navarro store S in compressed form, they extract nodes' path labels in order 
to verify them. For example, if S* = abaahabaabaab, then the reverses of the 
phrases are shown below on the left with the phrase numbers, in order by phrase 
number on the left and in lexicographic order on the right; the suffixes starting 
at phrase boundaries are shown on the right. When building the Patricia trees, 
we treat S as ending with a special character $ lexicographically less than any in 
the alphabet, and each reversed phrase as ending with another special character 

Figure |3] shows the Patricia trees for this example. 
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Their next component is a data structure for four-sided range reporting on 
a z X z grid storing z points, with each point indicating that the lexi- 
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Figure 3: The Patricia trees for the reversed phrases (left) and suffixes starting at phrase 
boundaries (right) in the LZ77 parse of "abaababaabaaU' . 
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Figure 4: A grid showing how, in the LZ77 parse of "abaababaabaab" , reversed phrases precede 
suffixes starting at phrase boundaries. 



cographically ith reversed phrase is followed in S by the lexicographically jth 
sufRx starting at a phrase boundary. Figure [4] shows the grid for our running 
example S = abaababaabaab. Kreft and Navarro use a wavelet tree, which takes 
0{z) space and answers queries in 0{{p + 1) logz) time, where p is the number 
of points reported |T6 ]. Many other data structures are known for this problem, 
however, with different time-space tradeoffs. 

Their final component is (essentially) a data structure for two-sided range 
reporting on an n x n grid storing at most z — 1 points, with each point 
indicating that S[i..j] is a phrase's source. The grid for 5* = abaababaabaab is 
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shown beside the LZ77 parse in Figure |2] They implement this data structure 
with a compressed bitvector (as a predecessor data structure) and a range- 
minimum data structure, which take O(zlogn) + o(n) bits of space and answer 
queries in 0{p + 1) time, where p is again the number of points reported. Again, 
however, other time-space tradeoffs are available. 

Given a pattern P[l..m], Kreft and Navarro use the two Patricia trees to 
find, for 1 < i < m, the lexicographic range of the reverses of phrases ending 
with P[l..i], and the lexicographic range of the suffixes starting with P[i + l..m] 
at phrase boundaries. This takes a total of ©(m^) time to descend the Patricia 
trees and 0(rn^d^ time to extract nodes' path labels. They then use the wavelet 
tree to find all the phrase boundaries preceded by P[l..i] and followed by P[i + 
l..m], which takes a total of ©((m + occ) log z) time. After these steps, they 
know the locations of all occurrences of P that cross phrase boundaries in S", 
which are called primary occurrences. 

An occurrence of P that is completely contained within a phrase is called 
a secondary occurrence. By the definition of LZ77, the first occurrence must 
be primary and any secondary occurrence must be copied from an earlier oc- 
currence. We can find all secondary occurrences by finding all primary occur- 
rences and then recursively finding all phrase sources that cover occurrences we 
have already found. Notice that, if a phrase source S[i..j] covers an occurrence 
S[i' ..i' + m— f], then i < i' and j > i' + m— 1, so the point (i, j) is above and to 
the left of the point (i', i' -f to — f ). It follows that, after finding all primary oc- 
currences of P, Kreft and Navarro can find all secondary occurrences in ©(occ) 
time using one two-sided range reporting per occurrence. Therefore, their self- 
index supports pattern matching in a total of 0{m?d + (m + occ) log 2;) time. 

We can use a new data structure by Chan, Larsen and Patra§cu [T^ for 
four-sided range reporting, instead of a wavelet tree, and a y-fast trie |18) 
for predecessor queries, instead of a compressed bitvector. Calculation shows 
that Kreft and Navarro's space bound then changes to ©(z log log z) words and 
their time bound improves to ©(m^d -I- to log log z + occ log log n) . Bille and 
G0rtz [19] showed how, by storing one-dimensional range-reporting data struc- 



8 



tures at each node in the top log log z levels of the Patricia trees, we can elimi- 
nate the ©(to log log z) term: if m < log log z then instead of the data structure 
for four-sided range reporting, we can use the one-dimensional range-reporting 
data structures, which are faster; otherwise, the 0(m^^ term dominates the 
C(TOloglogz) term anyway. Thus, by implementing the components differently 
in Kreft and Navarro's self- index, we obtain one that takes ©(z log log z) space 
and supports pattern matching in 0{m?d + occ log log rt) time. 

If we are given an SLP for S with r rules then we can also combine Bille et 
al.'s [5] with our modification of Kreft and Navarro's. We can use Bille et al.'s 
data structure for extracting nodes' path labels while pattern matching, so we 
obtain a self-index that takes ©(r -I- z log log z) space and supports extraction 
in ©(f + logn) time and pattern matching in 0{m^ + to log ?i + occ log log n) 
time. In Section [s] we explain how to remove the ©(to log n) term by taking 
advantage of the fact that, while pattern matching, we extract nodes' path 
labels only from phrase boundaries. 

3. Self-Indexing with an Unbalanced SLP 

Suppose we are given an SLP for S with r rules and a list of t specified 
positions from which we want to support linear-time extraction, e.g., from the 
phrase boundaries in the LZ77 parse. We can build an instance of Bille et 
al.'s [5] data structure and support extraction from any position in 0{i + logn) 
time, where £ is the length of the substring extracted. When £ — f2(logn) we 
have 0{£ + logn) = 0{£), i.e., the extraction is linear-time. Therefore, we need 
worry only about extracting substrings of length o(log n) from around the t 
specified positions. 

Consider each substring that starts logn characters to the left of a specified 
position and ends log n characters to the right of that position. By the definition 
of LZ77, the first occurrence of that substring crosses a phrase boundary. If we 
store a pointer to the first occurrence of each such substring, which takes 0{t) 
space, then we need worry only about extracting substrings of length o(log n) 
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from around the phrase boundaries. Now consider the string obtained 
from S by removing any character at distance more than log n from the nearest 
phrase boundary. Notice that S' can be parsed into 0{z) substrings, each of 
which 

• occurs in S, 

• has length at most logn, 

• is either a single character or does not touch a phrase boundary in the 
LZ77 parse of S. 

We claim that any such substring S'[i..j] is split between at most 2 phrases 
in the LZ77 parse of S'. To see why, consider that the first copy of S'[i..j] 



in S must touch a phrase boundary and is completely within distance log n 
of that phrase boundary, so it remains intact in S' . Therefore, either S'[i..j] 
is a single character — which is obviously contained within only 1 phrase in 
the LZ77 parse of S' — or S'[i..j] is not the first occurrence of that substring 
in S' . It follows that the LZ77 parse of S' consists of 0{z) phrases. Clearly 
n' = 0(2; log n), so we can apply Rytter's algorithm to build a balanced SLP 
for S' that has r' — O{z\og{n' / z)) — 0(2: log log 71) rules. Since this SLP is 
balanced, its parse tree has height O{\ogz + log log n) and so we can store it in 
O(r') = ©(z log log n) space and support extraction from any position in 5' in 
0{e + log n') = 0{£ + log z) time. 

We now have a data structure that takes 0(7' + 1 + zloglogn) space and 
supports extraction from any position in S in 0{£ + log 71) time and extraction 
from the t specified positions in 0{i + logz) time. If we choose the specified 
positions to be the phrase boundaries in the LZ77 parse of S, then we can 
combine it with our modification of Kreft and Navarro's index from Section[2]and 
obtain a self- index that takes 0{r + z log log 77) space and supports extraction 
in C'(^ + logn) time and pattern matching in O (777^ + 771 log z + occ log log 77) 
time. We next eliminate the 0(777 log 2) term by taking advantage of the fact 
that the SLP for S" is balanced. 
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As noted in Section [T] an SLP is balanced if the corresponding parse tree is 
height- or weight-balanced. Suppose we are given a position i in S' and a bound 
L and asked to add 0(1) space to our balanced SLP for S' such that we can 
support extraction of any substring S'[i..i + £ — I] with £ < L in 0{£ + logL) 
time. Supporting extraction of any substring S'[i — £+l..i] in 0{£ + log L) time 
is symmetric. We find the lowest common ancestor u of the ith and {i + i)th 
leaves of the parse tree T for S' . We then find the deepest node v in u's left 
subtree such that w's subtree contains the ith leaf of T, and the deepest node 
w in u's right subtree such that w's subtree contains the {i + L)th leaf. Since 
our SLP for S' is balanced, v and w have height 0{logL). To see why, consider 
that the ancestors of the rightmost leaf in u's subtree and of the leftmost leaf 
in its right subtree, have exponentially many leaves in their height. 

Without loss of generality, assume f 's subtree contains the ith leaf. We store 
the non-terminals at v and w in O{logr') bits, and 0{\ogL) bits indicating the 
path from v to the ith leaf; together these take 0(1) words. Figure [s] shows an 
example. We can view the symbols at nodes of T as pointers to those nodes and 
use the rules of the grammar to navigate in the tree. To extract S'[i..i+£— 1], we 
start at v, descend to the ith leaf in T, and then traverse the leaves to the right 
until we have either reached the {i + £ — l)st leaf in T or the rightmost leaf in 
v^s subtree; in the latter case, we perform a depth-first traversal of w's subtree 
until we reach the (i + £ — l)st leaf in T. During both traversals we output the 
terminal symbol at each leaf when we visit it. If we store the size of each non- 
terminal's expansion (i.e., the number of leaves in the corresponding subtree of 
the parse tree) then, after descending from v to the ith leaf, in 0{\ogL) time we 
can compute a list of O{log£) terminal and non-terminal symbols such that the 
concatenation of their expansions is S"[i..i + £ — 1]. This operation will prove 
useful in Section HI 

Since we can extract any substring S[i..i + £—l] in 0{£ + logL) time and ex- 
tracting any substring S"[i— £+l..i] in 0{£ + log L) time is symmetric, we can ex- 
tract any substring of length £ that crosses position i in S' in 0{£ + log L) time. 
We can already extract any substring in C'(^ + log7i') time, so we first choose 
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Figure 5: To support fast extraction from position i, wo store the non-terminals at v and w 
and the path from v to the ith leaf in the parse tree. 

L = \ogn' and store 0(1) words to be able to extract any substring that crosses 
position i in 0{£ + log log n') time. We then choose L = log log n' and store an- 
other 0(1) words to be able to extract any such substring mO{£ + log log log n') 
time. After log* n' iterations, we have stored ©(log* n') words and can extract 
any such substring in 0{£) time. 

Lemma 1. Given a balanced SLP for a string S'[l..n'] and a specified position 
in S", we can add ©(log* n') words to the SLP such that, if a substring of length 
£ crosses that position, then we can extract that substring in 0{£) time. 

Applying Lemma [T] to each of the positions in S' of the phrase boundaries 
in the LZ77 parse of S, then combining the resulting data structure with our 
instance of Bille et al.'s data structure for S, we obtain the following corollary. 

Corollary 2. Given an SLP for S with r rules and a list oft specified positions, 
we can store S in 0{r + t + z log log n) space such that, if a substring of length 
£ crosses a specified position, then we can extract that substring in 0{£) time. 

Applying Corollary [2] to S and choosing the t specified positions to be the 
phrase boundaries in the LZ77 parse, we obtain a data structure that takes 
C(r -|- z log log n) space and supports extraction in ©(^-flogn) time and ex- 
traction from around phrase boundaries in 0{£) time. Combining that with our 
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modification of Kreft and Navarro's self-index from Section [2j we obtain our 
first main result. 



Theorem 3. Given a straight-line program with r rules for a string S[l..n] 
whose LZ77 parse consists of z phrases, we can store a self-index for S in 
0(r + ^loglogn) space such that we can extract any substring of length £ in 
+ log n) time and, given a pattern P[l..m\, we can list the occ occurrences 



We note that this self-index supports fast circular pattern matching (see, 
e.g., |5D]), for which we want to find all the cyclic shifts P[j + l..?7i]P[l..j] of 
P that occur in S. Listing the occurrences can be handled in the same way as 
listing occurrences of P, so we ignore that subproblcm here. We modify our 
searching algorithm such that, when we would search in the first Patricia tree 
for the reverse {P[l..i])^^ of a prefix of P and in the second Patricia tree for the 
corresponding suffix P[i + l..m], we instead search for {P[i -f l..r7i]P[l..j])-'^ and 
P[i + l..m]P[l..i], respectively. We record which nodes we visit in the Patricia 
trees and, when we stop descending (possibly because there is no edge whose 
label starts with the correct character), we extract the path label for the last 
node we visit in either tree and compute how many nodes' path labels match 
prefixes of {P[i + l..m]P[l..i])'^ and P[i + l..m]P[l..i]. 

For each node v we visit in the first Patricia tree whose path label matches a 
prefix of (P[i-|-l..m]P[l..i])^, we find the first node w (if one exists) that we visit 
in the second Patricia tree whose path label matches a prefix of P[i + l..m]P[l..i] 
and such that the sum of the lengths of the path labels of v and w is at least m. 
For each such pair, we perform a range-emptiness query (i.e., a range-reporting 
query that we stop early, determining only whether there are any points in the 
range) to check whether there are any phrase boundaries that are immediately 
preceded by the reverse of w's path label and immediately followed by w's path 
label. These phrase boundaries are precisely those that are crossed by cyclic 
shifts of P with the boundary between P[i] and P[i + 1]. This takes a total of 
O (m^ log log z) time. 




of P in S in 0{w? -\- occ log log n) time. 
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A similar idea works for finding the maximal substrings of P that occur in 
S. For each 1 < i < m, we can use doubling search — with a range-emptiness 
query at each step — to find the longest suffix P[h..i] of such that some 

phrase boundary is immediately preceded by P[h..i] and immediately followed 
by P[i + 1]. We then use doubling search to find the longest prefix P[i + of 
P[i + l..m] such that some phrase boundary is immediately preceded by P[h..i] 
and immediately followed by P[i + Notice that P[h..j] is the leftmost 

maximal substring of P crossing a phrase boundary at position i, and we record 
it as a candidate maximal substring of P occurring in S. 

We now use doubling search to find the longest suffix P[h' ..i] of P[h + 
such that some phrase boundary is immediately preceded by P[h' ..i] and imme- 
diately followed by P[i + l..j + 1], then we use doubling search to find the longest 
prefix P[i + such that some phrase boundary is immediately preceded by 
P[h'..i] and immediately followed by P[i + Notice that h' > h, j' > j 

and P[h' ..j'] is the second maximal substring of P crossing a phrase boundary 
at position i, and we record it as another candidate maximal substring of P 
occurring in S. 

We repeat this procedure until we have recorded all the candidate maximal 
substrings crossing a phrase boundary at position i. While we work, the left 
endpoints of the prefixes and right endpoints of the suffixes we consider do 
not move left, so we use a total of ©(mloglogz) time to find the candidates 
associated with each position i. Since two candidate associated with the same 
position cannot contain each other, there are at most m of them. Once we have 
all the candidates for every position i, finding the true maximal substrings of P 
that occur in S takes O(to^) time. In total we use O(m^loglogz) time. 

Corollary 4. Given a pattern P[l..m], we can use the self-index described in 
Theorem^to compute in O (m^ log log 2;) time all the cyclic shifts and maximal 
substrings of P that occur in S. 
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4. Self-Indexing with a Balanced SLP 

In this section we describe how, if the SLP we are given for S happens to be 



maps strings to numbers, where a is the size of the alphabet, g is a prime 
and we interpret each character T[j] as a number between and a — I. If we 
choose q uniformly at randomly from among the primes at most n'^ then, for 
any two distinct strings T[l..^] and T'[1.J] with £ < n, the probability that 
f{T) = f{T') is ©(clogcr/ri^^^). Therefore, we can use Karp-Rabin hashes 
that fit in 0(1) words with almost no chance of collisions. Notice that, once 
we have computed the Karp-Rabin hashes of all the prefixes of a string, we can 
compute the Karp-Rabin hash of any substring in 0(1) time. Moreover, given 
the Karp-Rabin hashes of two strings, we can compute the Karp-Rabin hash of 
their concatenation in 0(1) time. 

Consider the problem of finding the lexicographic range of the suffixes start- 
ing with P[i + l..m] at phrase boundaries. The problem of finding the lexico- 
graphic range of reversed phrases ending with P[l..i] is symmetric. Suppose we 
augment the Patricia tree for the suffixes by storing at each node u the Karp- 
Rabin hash of m's path label. This takes 0(z) extra space and, assuming our 
Karp-Rabin hash causes no collisions and we have already computed the Karp- 
Rabin hashes of all the prefixes of P, lets us find the deepest node v whose path 
label is a prefix oi P[i + \..m] in time proportional to w's depth. In the worst 
case, however, u's depth could be as large as to — i. Fortunately, while studying 
a related problem, Belazzougui, Boldi, Pagh and Vigna [221 [23] showed how, by 
storing one Karp-Rabin hash for each edge, we can use a kind of binary search 
to find V in O (log to) time. Ferragina [24] gave a somewhat simpler solution in 
which he balanced the Patricia tree by a centroid decomposition. His solution 
also takes 0(z) space but with O(logz) search time. 



balanced, then we can improve the time bound in Theorem [Sl using Karp-Rabin 
hashes [3T1. A Karp-Rabin hash function 
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If the length of v's path label is exactly m — i then, again assuming our 
Karp-Rabin hash causes no collisions, u's path label is P[i + l..m]. Otherwise, 
ii's path label is a proper prefix of P[i + L.to] and in 0(1) time we can find 
the edge descending from v (if one exists) whose label begins with the next 
character of P[i + l-.m]. Let w be the child of v below this edge. If the length 
of w's path label is at most m — i then we know by our choice of v that no suffix 
starts at a phrase boundary with P[i + l..m]. Assume w's path label has length 
at least m ~ i + 1. If any suffix starts at a phrase boundary with P[i + l..rn], 
then those that do correspond to the leaves in w's subtree. We cannot determine 
from looking at Karp-Rabin hashes stored in the Patricia tree, however, whether 
there are any such suffixes. In order to determine this, we use the balanced SLP 
to compute the Karp-Rabin hash of the first m — i characters of w's path label. 

Recall from Section [3] that, given a balanced SLP with r rules for a string 
S, a specified position i in 5 and a bound L, we can add 0(1) space such 
that later, given a length £ < L, in O(logL) time we can compute a list of 
0(log£) terminal and non-terminal symbols such that the concatenation of their 
expansions is S[i..i + ^ — 1]. (This is the same information we store to extract 
S[i..i + £— 1].) It follows that, if we store the Karp-Rabin hash of the expansion 
of every non-terminal symbol, then we can compute the Karp-Rabin hash of 
S[i..i + £ — 1] in O(logL) time. Symmetrically, we can add 0(1) space such 
that we can compute in O(logL) time the Karp-Rabin hash of any substring of 
length at most L that ends at position i. Therefore, we can add 0(1) space such 
that we can compute in O(logL) time the Karp-Rabin hash of any substring of 
length at most L that crosses position i in S. As long as L is polynomial in the 
length £ substring whose Karp-Rabin hash we want, logL — 0(log£). If we fix 
e > and apply this construction with L set to each of the 0(loglog7i) values 

2 3 

,n'^ , n*^ , . . . , 2, then we obtain the following result. 

Lemma 5. Given a balanced SLP for a string S[l..n] and a specified position in 
S, we can add O(loglogn) words to the SLP such that, if a substring of length 
£ crosses that position, then we can compute its Karp-Rabin hash in 0(log£) 
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time. 



Applying Lemma |5] to each of the phrase boundaries in the LZ77 parse of S", 
we obtain the foUowing corollary. 

Corollary 6. Given a balanced SLP for S with r rules, we can store S in 
0(r + ^loglogn) space such that, if a substring of length £ crosses a phrase 
boundary, then we can compute its Karp-Rabin hash in 0(log£) time. 

Combining Corollary [6] with Belazzougui et al.'s construction, we obtain 
a data structure that takes 0(r + zloglogn) space and allows us to find in 
C(logm) time the lexicographic range of the suffixes starting with P[i + l..m] 
at phrase boundaries, assuming our Karp-Rabin hash causes no collisions and 
we have already computed the Karp-Rabin hashes of all the prefixes of P. Since 
computing the Karp-Rabin hashes of all the prefixes of P takes 0{m) time and 
we need do it only once, it follows that we can find in a total of ©(to log to) time 
the lexicographic range of the suffixes starting with P[i + l..m] for every value 
of i and, symmetrically, the lexicographic range of the reversed prefixes ending 
with P[l..i]. Combining this data structure with Lemma[Tj we can also support 
extraction in 0{logn + i) time and extraction from around phrase boundaries 
in 0{i) time. Combining this data structure with our modification of Kreft and 
Navarro's self-index from Section [2] we obtain our second main result, below, 
except that our search time is 0{m log m + m log log z + occ log log n) instead of 
O (to log TO -I- occ log log n) . 

Theorem 7. Given a balanced straight-line program with r rules for a string 
iS'[l..n] whose LZ77 parse consists of z phrases, we can store a self-index for S 
in 0{r + z log log n) space such that we can extract any substring of length £ in 
C(^ -|- log n) time and, given a pattern P[1..to], we can list the occ occurrences 
of P in S in O (to log to 4- occ log log 7i) time. Our construction is randomized 
but, given any constant c, we can bound by the probability that we build a 

faulty index. 
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Unfortunately, this time we cannot use Billc and G0rtz' [TU] approach alone 
to eliminate the 0(rn log log z) term. When m < log log z, storing one-dimensional 
range-reporting data structures at nodes in the top log log z levels of the Patri- 
cia trees means we use ©(r -I- zloglogn) space and ©(mlogm -I- occloglogn) 
search time; when m > logz, the ©(mlogm) term dominates the ©(mloglogz) 
term anyway. To deal with the case log log z < m < log 2, we build a Patricia 
tree for the set of 0{z log z) substrings of S that cross a phrase boundary, start 
at most log z characters before the first phrase boundary they cross, and end 
exactly logz characters after it (or at S[n], whichever comes first). At the leaf 
corresponding to each such substring, we store ©(log logz) bits indicating the 
position in the substring where it first crosses a phrase boundary. In total this 
Patrica tree takes ©(z log logz) words. 

If log logz < m < logz, we search for P in this new Patricia tree, which 
takes 0{m) time. Suppose our search ends at node v. If P occurs in 5, then 
the leaves in P's subtree store the distinct positions in P's primary occurrences 
where they cross phrase boundaries. To determine whether P occurs in S, it 
suffices for us to choose any one of those positions, say i, and check whether 
there is a phrase boundary immediately preceded by and immediately 

followed by P[i + l..m]. To do this, we search in our first two augmented Patricia 
trees and perform a range-emptiness query. If to < log log z time then we can 
perform the range-emptiness query with the one-dimensional range-reporting 
data structures in 0(1) time; otherwise, we perform the range-emptiness query 
with our data structure for four-sided range reporting in ©(log logz) C ©(to) 
time. If we learn that P does not occur in S, then we stop here, having used a 
total of 0{m) time. If we learn that P does occur in S, then in ©(occ) time we 
traverse w's subtree to obtain the full list of distinct positions in P's primary 
occurrences where they first cross phrase boundaries. For each such position, we 
search in our first two augmented Patricia trees and perform a range-reporting 
query. This takes ©(to log to + occloglogz) time and gives us the positions of 
all P's primary occurrences in S. 



18 



5. Future Work 



We are currently working on a practical implementation of our self-index. 
We believe the most promising avenue is to use Maruyama, Sakamoto and 
Takeda's |5 algorithm to build a balanced SLP, a wavelet tree as the range- 
reporting data structure [551 US] and Ferragina's jM] restructuring to balance 
the Patricia trees. When m < z — which is the case of most interest for many 
applications in bioinformatics — this implementation should take 0{z) space 
and support location of all occi primary occurrences in ©((to -I- occi) log z) 
time, with reasonable coefficients. As we have explained, finding all secondary 
occurrences is relatively easy once we have found all the primary occurrences. 

Approximate pattern matching is often more useful than exact pattern match- 
ing, especially in bioinformatics. Fortunately, Russo, Navarro, Oliveira and 
Morales |26| showed how to support practical approximate pattern matching 
using indexes for exact pattern matching, and we believe most of their tech- 
niques are applicable to our self-index. One potential problem is how to perform 
backtracking using Patricia trees augmented with Karp-Rabin hashes, without 
storing or extracting edge labels. This is because comparing hashes tells us 
(with high probability) when strings differ, but it does not tell us by how much 
they differ. We are currently investigating a new variant of Karp-Rabin hashes 
by Policriti, Tomescu and Vezzi |27j that roughly preserves Hamming distance. 

Finally, we have shown elsewhere |28j that supporting extraction from speci- 
fied positions has applications to, e.g., sequential approximate pattern matching. 
In that paper we developed a different data structure to support such extraction, 
which we have now implemented and found to be faster and more space-efhcient 
than Kreft and Navarro's solutions. Nevertheless, we expect the solutions we 
have given here to be even better. 
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